Efficient Computation of Statistical Significance of Query Results in Databases
نویسندگان
چکیده
Queries such as database similarity searches return results satisfying certain properties of distances or scores. For domain scientists, the absolute values of scores are seldom sufficient. Statistical significance or p-value of the result is a more useful criterion. This can be computed using an appropriate model of random objects. The problem of computing p-values becomes more acute when queries have multiple components. In this case, the returned score is an aggregate of individual scores. The simple way of calculating the p-value by enumerating all random possibilities fails for large database and query sizes. We propose an efficient method to calculate the approximate p-value of a multi-attribute result when the distribution of scores for the database objects is non-parametric. Experimental evaluation on large databases shows that our method is practical, runs 5 orders of magnitude faster than the basic approach, and has an error of less than 5% in p-value computation. 1 Motivation and Problem Statement Many database systems retrieve results based on some distance or score measure between the query object and the database objects. Score is a quantitative measure of the similarity between objects based on multiple attributes. It has been widely used for ranking results in content-based multimedia retrieval systems. However, with the growing interest in analyzing the results of a database similarity query, computing rigorous statistical properties of the results is more meaningful. Statistical significance helps the domain scientists in understanding the nature of the query and the statistical properties of the database objects. The most well known example is BLAST [1]. A standard measure of statistical significance is the p-value. The p-value of score s of a query result from a database is defined as the probability of randomly obtaining a result from the database with a score s or higher for the same query. It is the area under the probability distribution function (pdf) of the scores of random objects greater than s. For a database management system (DBMS) serving single object queries, the score pdf can be characterized or calculated, and so, the p-value can be computed. However, there are database systems of complex objects where each object consist of multiple attributes or components. Such systems support queries with multiple attributes or objects and the score of a result is some aggregate function (e.g., sum) of the individual Algorithm PRUNE Input: Query Q = ∪ri=1Qi, Score s, Database D, Number of bins b Output: P-value p 1. for i = 1 to r 2. Di := 1-NN(Qi, D) 3. hi := BinHistogram(Di, b) 4. end for /* σi is the sum pdf of bin histograms 1, · · · , i */ 5. σ1 := h1 6. for i = 2 to r 7. B(σi) := s− Pr j=i+1 max(hj) 8. B(hi) := B(σi) − max(σi−1) 9. B(σi−1) := B(σi) − max(hi) 10. σi := Convolute(all bins σi−1,j ≥ B(σi−1), all bins hi,k ≥ B(hi)) 11. end for 12. p := Sum of probabilities in all bins σr,j ≥ s Fig. 1. The PRUNE algorithm. scores of each query component against its corresponding result component [5]. These queries are common for region based image retrieval (RBIR) systems [3] and information retrieval systems [9]. For example, in an RBIR system, a query region is composed of a number of sub-regions (e.g., tiles) [2, 10]. The database images are also split into sub-regions. Each component sub-region has a corresponding score of its match with a query sub-region. The score of a result is the sum of the individual scores. For a given query object Q of size r, a random database for computing the p-value can be modeled by considering all possible aggregates of size r composed of components from the database. To find the p-value, we need to calculate the score pdf for this random database. This simple method has a running time that grows exponentially with database size and query size and is, therefore, impractical. In this paper, we propose and solve the following problem: “Given a query Q composed of r objects Qi, i = 1, · · · , r, database objects Dj , j = 1, · · · , n, scoring functions fi : Qi ×D → R, compute the p-value of obtaining a score s for a result R = ∪ri=1Ri, where s = ∑r i=1 f(Qi, Ri), for a random database of objects, each having r component objects.” Methods have been proposed for obtaining a single measure of statistical significance by combining the individual p-values. For example, the method in [4] requires finding the correlation among the attributes, which is done by sampling for large datasets. We adopt a more direct approach. We find the sum pdf of the individual pdfs of the components of the query. Then we calculate the p-value from this sum score pdf. Since score pdf of each component is independent of the other, this pdf is the convolution of all the individual pdfs. For most databases, the nature and the parameters of this pdf cannot be computed. We consider such cases where the probability distribution function of the cumulative scores is non-parametric.
منابع مشابه
A Trust Based Probabilistic Method for Efficient Correctness Verification in Database Outsourcing
Correctness verification of query results is a significant challenge in database outsourcing. Most of the proposed approaches impose high overhead, which makes them impractical in real scenarios. Probabilistic approaches are proposed in order to reduce the computation overhead pertaining to the verification process. In this paper, we use the notion of trust as the basis of our probabilistic app...
متن کاملEEQR: An Energy Efficient Query-Based Routing Protocol for Wireless Sensor Networks
Routing in Wireless Sensor Networks (WSNs) is a very challenging task due to the large number of nodes, their mobility and lack of proper infrastructure. Since the sensors are battery powered devices, energy efficiency is considered as one of the main factors in designing routing protocols in WSNs. Most of energy-aware routing protocols are mere energy savers that attempt to decrease the energy...
متن کاملEEQR: An Energy Efficient Query-Based Routing Protocol for Wireless Sensor Networks
Routing in Wireless Sensor Networks (WSNs) is a very challenging task due to the large number of nodes, their mobility and lack of proper infrastructure. Since the sensors are battery powered devices, energy efficiency is considered as one of the main factors in designing routing protocols in WSNs. Most of energy-aware routing protocols are mere energy savers that attempt to decrease the energy...
متن کاملStatistics of large scale sequence searching
MOTIVATION Database search programs such as FASTA, BLAST or a rigorous Smith-Waterman algorithm produce lists of database entries, which are assumed to be related to the query. The computation of statistical significance of similarity scores is well established for single pairs of sequences and using purely random models. However, the multi-trial context of a database search poses new problems....
متن کاملRelational Databases Query Optimization using Hybrid Evolutionary Algorithm
Optimizing the database queries is one of hard research problems. Exhaustive search techniques like dynamic programming is suitable for queries with a few relations, but by increasing the number of relations in query, much use of memory and processing is needed, and the use of these methods is not suitable, so we have to use random and evolutionary methods. The use of evolutionary methods, beca...
متن کامل